Skip to content

ENH: Optimize nrows in read_excel #35974

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 97 commits into from
Sep 21, 2020

Conversation

MarcoGorelli
Copy link
Member

@MarcoGorelli MarcoGorelli commented Aug 29, 2020

based on #33281


output of asv benchmarks:

(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev/asv_bench$ asv continuous -f 1.1 upstream/master optimise-nrows-excel -b excel.ReadExcel
· Creating environments..................................................................................................................................
· Discovering benchmarks
·· Uninstalling from conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building d0a8a687 <optimise-nrows-excel> for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....................................
·· Installing d0a8a687 <optimise-nrows-excel> into conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit c413df6d <master> (round 1/2):
[  0.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt....................................
[  0.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 12.50%] ··· Setting up io.excel:62                                                                                                                                                                               ok
[ 12.50%] ··· Running (io.excel.ReadExcel.time_read_excel--)..
[ 25.00%] · For pandas commit d0a8a687 <optimise-nrows-excel> (round 1/2):
[ 25.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 25.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 37.50%] ··· Setting up io.excel:62                                                                                                                                                                               ok
[ 37.50%] ··· Running (io.excel.ReadExcel.time_read_excel--)..
[ 50.00%] · For pandas commit d0a8a687 <optimise-nrows-excel> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 62.50%] ··· Setting up io.excel:62                                                                                                                                                                               ok
[ 62.50%] ··· io.excel.ReadExcel.time_read_excel                                                                                                                                                                   ok
[ 62.50%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      953±6ms   
               openpyxl   1.66±0.03s 
                 odf      6.02±0.02s 
              ========== ============

[ 75.00%] ··· io.excel.ReadExcel.time_read_excel_nrows                                                                                                                                                             ok
[ 75.00%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      878±20ms  
               openpyxl   1.67±0.02s 
                 odf      4.58±0.04s 
              ========== ============

[ 75.00%] · For pandas commit c413df6d <master> (round 2/2):
[ 75.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 75.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 87.50%] ··· Setting up io.excel:62                                                                                                                                                                               ok
[ 87.50%] ··· io.excel.ReadExcel.time_read_excel                                                                                                                                                                   ok
[ 87.50%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      941±5ms   
               openpyxl   1.69±0.02s 
                 odf      6.15±0.04s 
              ========== ============

[100.00%] ··· io.excel.ReadExcel.time_read_excel_nrows                                                                                                                                                             ok
[100.00%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      971±20ms  
               openpyxl   1.69±0.01s 
                 odf      6.07±0.03s 
              ========== ============

       before           after         ratio
     [c413df6d]       [d0a8a687]
     <master>         <optimise-nrows-excel>
-        971±20ms         878±20ms     0.90  io.excel.ReadExcel.time_read_excel_nrows('xlrd')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@@ -453,7 +491,20 @@ def parse(
else: # assume an integer if not a string
sheet = self.get_sheet_by_index(asheetname)

data = self.get_sheet_data(sheet, convert_float)
get_sheet_data_header = 0 if header is None else header
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? Looks like there are already checks within the subsequent functions for int values, no?


for i in range(sheet_nrows):
if self.should_skip_row(i, header, skiprows, nrows):
data.append([])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to append the empty list here? Would be preferable to just continue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed should_skip_row entirely, I think the benefit from the original PR came from not reading all the rows into memory, rather than from optimising skipping rows

@WillAyd WillAyd added the IO Excel read_excel, to_excel label Sep 2, 2020
@MarcoGorelli MarcoGorelli marked this pull request as ready for review September 4, 2020 15:41
@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Sep 4, 2020

Performance benchmarks:

(pandas-dev) marco@marco-Predator-PH315-52:~/pandas-dev/asv_bench$ asv continuous -f 1.1 upstream/master optimise-nrows-excel -b excel.ReadExcel
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Building 2d9ee8de <optimise-nrows-excel> for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.................................
·· Installing 2d9ee8de <optimise-nrows-excel> into conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.
· Running 4 total benchmarks (2 commits * 1 environments * 2 benchmarks)
[  0.00%] · For pandas commit b53dc8f8 <optimise-nrows-excel^2> (round 1/2):
[  0.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt.....................................
[  0.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 12.50%] ··· Setting up io.excel:62                                                                                                                                                                    ok
[ 12.50%] ··· Running (io.excel.ReadExcel.time_read_excel--).
[ 25.00%] ··· Running (io.excel.ReadExcel.time_read_excel_nrows--).
[ 25.00%] · For pandas commit 2d9ee8de <optimise-nrows-excel> (round 1/2):
[ 25.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 25.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 37.50%] ··· Setting up io.excel:62                                                                                                                                                                    ok
[ 37.50%] ··· Running (io.excel.ReadExcel.time_read_excel--).
[ 50.00%] ··· Running (io.excel.ReadExcel.time_read_excel_nrows--).
[ 50.00%] · For pandas commit 2d9ee8de <optimise-nrows-excel> (round 2/2):
[ 50.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 62.50%] ··· Setting up io.excel:62                                                                                                                                                                    ok
[ 62.50%] ··· io.excel.ReadExcel.time_read_excel                                                                                                                                                        ok
[ 62.50%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd     1.01±0.02s 
               openpyxl   1.85±0.02s 
                 odf      6.04±0.03s 
              ========== ============

[ 75.00%] ··· io.excel.ReadExcel.time_read_excel_nrows                                                                                                                                                  ok
[ 75.00%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      877±8ms   
               openpyxl   1.83±0.01s 
                 odf      4.57±0.04s 
              ========== ============

[ 75.00%] · For pandas commit b53dc8f8 <optimise-nrows-excel^2> (round 2/2):
[ 75.00%] ·· Building for conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt..
[ 75.00%] ·· Benchmarking conda-py3.8-Cython0.29.16-jinja2-matplotlib-numba-numexpr-numpy-odfpy-openpyxl-pytables-pytest-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 87.50%] ··· Setting up io.excel:62                                                                                                                                                                    ok
[ 87.50%] ··· io.excel.ReadExcel.time_read_excel                                                                                                                                                        ok
[ 87.50%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      974±20ms  
               openpyxl   1.82±0.04s 
                 odf      6.21±0.1s  
              ========== ============

[100.00%] ··· io.excel.ReadExcel.time_read_excel_nrows                                                                                                                                                  ok
[100.00%] ··· ========== ============
                engine               
              ---------- ------------
                 xlrd      960±9ms   
               openpyxl    1.79±0s   
                 odf      5.87±0.01s 
              ========== ============

       before           after         ratio
     [b53dc8f8]       [2d9ee8de]
     <optimise-nrows-excel^2>       <optimise-nrows-excel>
-      5.87±0.01s       4.57±0.04s     0.78  io.excel.ReadExcel.time_read_excel_nrows('odf')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

The OP mentioned a memory leak from skipping rows in openpyxl, which I experienced too, so I've skipped optimising those files for now

@jreback jreback added this to the 1.2 milestone Sep 13, 2020
@jreback jreback added the Performance Memory or execution speed performance label Sep 13, 2020
@jreback
Copy link
Contributor

jreback commented Sep 13, 2020

looks good, i thought this would make a much bigger difference, but ok.

cc @WillAyd

@WillAyd WillAyd merged commit e975f3d into pandas-dev:master Sep 21, 2020
@WillAyd
Copy link
Member

WillAyd commented Sep 21, 2020

Thanks @MarcoGorelli

elif skiprows is None:
skiprows_nrows = 0
else:
skiprows_nrows = len(skiprows)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like its causing failures on master

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2020

@MarcoGorelli we are reverting this as something is broken in the evaluation here. If you can resubmit when you can.

jreback pushed a commit that referenced this pull request Sep 22, 2020
@MarcoGorelli
Copy link
Member Author

Sure - I'm really sorry for the breakage caused, but glad this was caught early!

@MarcoGorelli MarcoGorelli deleted the optimise-nrows-excel branch September 22, 2020 17:00
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Excel read_excel, to_excel Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_excel opimize nrows
5 participants